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Abstract: In this article we give our contribution to the problem of seg- 
mentation with plug-in procedures. We give general sufficient conditions 
under which plug in procedure are efficient. We also give an algorithm that 
satisfy these conditions. We give an application of the used algorithm to 
hyperspectral images segmentation. Hyperspectral images are images that 
have both spatial and spectral coherence with thousands of spectral bands 
on each pixel. In the proposed procedure we combine a reduction dimen- 
sion technique and a spatial regularisation technique. This regularisation is 
based on the mixlet modelisation of Kolaczyck and Al. [9|. 
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1. Introduction 

In this article we study the segmentation problem which is a particular learning 
problem that generalizes classification (as defined in [5]) by asking for multiple 
simultaneous decisions instead of a simple decision. In segmentation, we have an 
observation x = (x[l], . . . , x[AT]) in X N (in this paper, X = M. p but some results 
are more general). This observation is associated to a label y — (y[l], . . . , y[N}) 
with values in the product space {0, 1} . Note that we restrict ourself to the 
binary segmentation mainly to simplify the theoretical study, however, in the 
applications of this paper y takes values in {1, . . . ,K} N . In the segmentation 
problem, the label y is unknown and a segmentation procedure is a function 
g : X N — > {0, 1}^ that tries to guess the correct label associated to a given 
observation. For example, in a grayscale image segmentation, N is the number 
of pixel in the image and X — R, in the hyper-spectral image segmentation 
problem X = W with p very large. The segmentation error of a segmentation 
procedure g can be measured with a distance d on {0, 1}^ by d(g(X),Y). In 
this article we will use the normalized Hamming distance dn defined by 

1 N 

= 1 

The value of djj(g(X), Y) represents the proportion of misclassified pixels. 



Vx, y e {o, i} iV d H (x, y) = j^}^ 



i 
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In order to analyze the theoretical performances of the proposed procedures, 
we introduce a probabilistic setting, and let {X, Y) be an X N x {0, 1}^ valued 
random pair, modeling an observation and the corresponding label. Let Px be 
the distribution of X, Pxi the distribution of X\i] for all i € {1, . . . , N}, and 
Py the distribution of Y. For k = 0, 1, i = 1, . . . , N, let Pife be the probability 
distribution of X[i] given Y"[i] = k, let 7Tj = Py(Y"[i] = 0). The distribution of 
the random pair (X, Y) may be described by {{Pik)i,k, In this article, we 

make the following assumption 

Assumption 1. For all i, j € {0, . . . , N}, i ^ j , the random variables X[j] and 
Y[i] are independent. 

We measure the performance of a segmentation procedure g by E[d# (g(X), Y)] 
and it is easy to see that, under assumption [TJ the optimal procedure, e.g the 
one that minimizes E[g?# (g(X), Y)}, is given by 

Vie{i,...,JV} g *mx) = i 1<RM R^[i\= ] ^i^L { X\i]) (1) 

7Tj dP t0 

The two step framework in a plug-in perspective, the construction of a 
segmentation procedure approaching g* can be divided into two steps. 

• Step 1: Learning step. Find the substitute (Pio, Pn)i=i,...,N for the 
conditional distributions on each pixel (Pio, Pu)i=i,...,N- 

• Step 2: Segmentation Step. Find P £ x^L l Conv(P w , P a ) El 

using the 

observation X drawn from Px (the observed image). Note that finding 
such a distribution is equivalent with finding a set of weights tt(Px) = 

(MPx)h=i,...,N e [o,i] w with 

N 

Px = ]J (jri(Px)Pio + (1 - ni(Px))Pa) ■ (2) 
The (plugin) segmentation procedure obtained with this construction is 

ViG{l,...,iV} g(X)[i\ = t 1< ^ [ R*= 1 ~ n ;t P * ) ^-(X[i\). (3) 

Remark 1. The segmentation rule depends on the observation X[i] through the 
evaluation of the likelihood ratio (X [i] ) but also depends on the whole image 
X through the evaluation of the weigh vector tt(Px) in step 2. 

In the applications of this article, a learning set, composed of n independent 
random variables drawn from Poi and Pu Vi = 1,.. . ,N, is given in the first 
step. This is what we will refer to as the supervised segmentation, and this 
justifies the name of the first step. 

1 Conv(Pio, Pu) is the convex hull of { P;oi Pil} 
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Obtained rate of convergence. In order to measure the difference, between 
a segmentation procedure g and the best segmentation procedure g* , it conve- 
nient to introduce the excess risk: 



In this article, we give rates for the convergence of S(g) to zero. The procedure 
we describe in Section [2] for the estimation of the weight ir(Px) is a general 
model selection procedure introduced by Kolaczyk et Al. [9]. The algorithm in 
step 2 has never been used before but it is in line with step (c) in the work 
of Antoniadis et Al. pQ. Apart from our numerical studies and the fact that 
we combine step 1 and step 2, our main contribution is theoretical. Indeed, we 
obtained rates of convergence for a class of plugin segmentation procedures. The 
corresponding results are summarized in Theorem [2j It gives a relation between 
the convergence of S(g) and separately with the choice of (Pio, Pu)i=i,...,N an d 
the complexity of the class of possible weights tt(Px)- 

As an example, when (Pik)i=i,...,N,k=o,i ar e gaussian, the procedure g we give 



• (Pik)i=~L,...,N,k=a,i have the same covariance 

• (Pik)i=i,...,N,k=a,i have means satisfying a sparsity assumption. 

• / : i —> n(Px)[i] satisfies some smoothness assumption (which are fulfilled 
in the case of the boundary fragment model as described in [TTj). 

(recall that p is the dimension of X , N is the number of pixels in the image, n 
is the size of the learning set and the notation A < B means that there exists a 
constant c > such that A < cB.) 

Foreseen applications. In satellite imagery, images often contain more than 
200 spectral bands. In mars satellite imagery, (see [13]) geologists have a clear 
idea of what type component they will find within images and they can create a 
learning set. This learning set can be made out of samples from images that have 
been analyzed by an expert. Anyway, this expert cannot identify spectra from 
the tera bytes of data that flows from Mars to the earth, and the proportions of 
the different component in the learning set taken from a randomly chosen place 
on mars cannot be used to infer on what will be the proportion in a new image 
coming from another part of Mars. 

In medical imagery of the brain the problem is also exploratory, but the num- 
ber of images is relatively small and if experts can analyze images by themselves, 
contamination by noise makes a statistical support attractive. Images contain 
thousands of spectral bands. 



S(g) = nd H (g(X), Y)} - E[d H (g*(X), Y)\. 



(4) 



satisfies 




if 
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The remote-sensing literature about supervised and unsupervised segmenta- 
tion procedure of images is really large, however, only a few procedures have 
been developed to tackle the multi and hyperspectral image segmentation prob- 
lem. Finaly, we are not aware of any work providing theoretical assessment of 
image (or hyperspectral image) segmentation procedure with a learning step. 

Structure of the paper. This article is constructed as follows. In Section [2] 
we give our main theoretical result which concerns step 2 (segmentation step). 
In Section[3]we give an algorithm that aims at estimating the conditional density 
under gaussian assumption and when A" is a high dimensional space W with p 
large. This algorithm gives a solution for step 1 and is shown to satisfy necessary 
condition for step 2 to be consistent (i.e Assumptions [5] from Section [2]). In 
Section H] we apply the whole algorithm (step 1 plus step 2) to hyperspectral 
(medical and satellite) image segmentation. In Section [5] we give the proof of 
our theoretical results. 

2. Algorithm and main result 

In this section, we give the algorithm for step 2 (estimating the weights) and 
associated theoretical results. This can be considered as the main result of this 
paper. 

2.1. Mixlet estimation 

The mixlet algorithm has been introduced by Kolaczyk et Al. [5]. It a model 
selection algorithm based on a penalized maximum likelihood estimation. 

Let Mn be a finite subset of x^L 1 Conv(Pio, Pa) (i.e a subset of models) 
and periN ■ Mn — > K + a penalty function. Note that the set Mn can either 
be seen as a set of measures, as a set of weights, or (because it is finite) as a 
set of densities with respect to a dominating probability measure. The mixlet 
estimation of tt(Px) is obtained by finding Px given by 



For the penalty function and the associated set of models, we only require a 
kraft inequality: 

Assumption 2. The set of models Mn o,nd the associated penalty function 
penN satisfy 



e -pen N (Q) <- y 

QeM N 

for a positive constant C . 

This type of assumption is standard in model selection theory (see [2]). This 
can be seen as a complexity assumption on the set of models and penalty. Such 



P x = ArgmaxQ(zM N {log{Q{X j) - Apen N (Q)} . 



(5) 
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inequality then results from Kraft inequality, as, for example in Kolaczyk and 
Nowak [TU]. Equivalently, it can be seen as a topological covering bound, as, for 
example in Barron et Al. [2] . The way we use this inequality in the proof of the 
following theorem is inspired from the work of Birge [3] . 

2.2. The " d- dimensional" hyper spectral image 

In this example, we examine the choice of Mn and pen^ in a particular case 
related to a " d-dimensional" hyperspectral image. 

Each index of {1, . . . , N} will now be associated to the center of one of the 
N = n d pixels of a d-dimensional hypercube: [0, l] d (here we assume that pixels 
in the image are d-dimensional hypercube with equal size). As a consequence, 
giving a segmentation g{X) of X is equivalent with finding a particular partition 
of [0, l] d into groups of pixels. We will search those partition within the set of 
recursive dyadic partition. Recall that a recursive dyadic partition of [0, l] d (in 
short RDPd) is a partition constructed recursively and associated to a 2 d -tree, 
e.g a tree with 2 d sons or one leave at each node. A splitting of a node in the 
tree correspond to a splitting of a <i-dimensional hypercube into 2 d identical 
hypercubes. 

The set of models M. n and the associated penalty function will be use in the 
numerical application to hyperspectral image segmentation. For i = 1, . . . ,N, 
we will search %{Px)i in a regular grid of [0,1] with iV 3 / 2 elements (take the 
entire part of N 3 / 2 if it is not round). This grid of [0, 1] will be denoted Gn- 
Finally A4n will be the set of product distribution Q = Tlf =l Qi on X N with 
Qi e conv({Pu, Pio}), 7r(Q), e Gn for all i = 1, . . . ,N and i — > ir(Q)i constant 
on each piece of a given RDPd- The minimal RDPd on which 7r(Q)j is constant 
will be V(Q). The function pen(Q) penalize the partitions that are too rich: 

peniQ)^^ 1 - 1 Q log iV +^ log 2^ , (6) 

where m d ~ 1 is the number of elements of the partition V(Q). It is known that 
with this type of penalty, we have a kraft inequality 

e -P e "(Q) < 1, (7) 

QGMn 

(see for example [9]) end hence Assumption [2] is fulfilled. 

The corresponding model selection algorithm (for step 2), i.e used to find the 
minimum in Equation §5§ with the defined set of models and associated penalty 
function, can be implemented efficiently with a pyramidal algorithm (see [Hj) 
and has been called mixlet algorithm. 
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2.3. Theoretical result 



Before we state the theoretical result we give assumptions that have to be ful- 
filled 

Assumption 3. There exists a positive constant c such that mfi<i<jv \P%o ~ 
Pn\i > c, where \P — Q\i is the L\ distance between P and Q (two distributions 
on X) given by \P — Q\i = J x \dP — dQ\. 

Assumption 4. There exists a positive constant c' such that inf i<i<N nrin(7Ti, 1— 

7Ti) > C. 

Assumption 5. There exists C > such that 

VAi.fa e {0,1}, i G {1, . . . , N} X 2 (Pik 1 ,Pik 2 ) < C, 

where x 2 (Pi, Pg) is the chi square divergence between two probability distribution 
(Pi and Pq) defined by 



! (Pi,Po) = ( /fe- 1 ) 2 ^ ifPi«?o 
{ oo else 



(8) 



The obtained result is the following 

Theorem 1. Under the Assumptions fUE^, and if g is a classification rule 
constructed with the two step given in the preceding section ( defined by Equation 
we have 

NS(g) <^ N ,n = C N .n+ inf h(R,N) (9) 

R£Mn 

as long as Ne~ N ^ N ' n = 0(ipN^ n ) where Co is a positive constant, S the excess 
risk defined by Equation Oj 



VReM N , h(R, N) = ||tt(P x ) - %(R)\\ h +pen N (R). 
and the error term related to the learning step is given by 

'dPoidPii 



N 



dPn dPoi 



whe 



D i0 = max x 2 (P i0 , P%o), Ep i0 



Ai = max x 2 (Pa, Pa), E Pil 



Ao + Ai 



Pil — Pil 



(10) 



(11) 



Pil 



PiO — PiO 



P 



(0 



The proof of this result is postponed in the annex. Let us discuss the assump- 
tions of this Theorem. 
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• In order to get a rate of convergence for the full process (step 1 and step 
2) we need to provide a bound to E[£ ni jy] (where this last expectation is 
with respect to the learning set). This will be the purpose of Section [3] 

• Assumption [4] is rather strong. If the number of pixel with a pure com- 
ponent (7Tj = or 1) is small (i.e the order of the upper bound in the 
preceding theorem), the results are still valid. We think that the construc- 
tion of Mn should be changed to avoid this assumption, in particular the 
discretization of the set of values for m should be refined near and 1. 
This will be the purpose of further work. 

• Assumption 4 is related to the choice of the model in ad-equation with 
the structure of ir(Px)- In the next section we explain how this choice can 
be done in the case of a "d— dimensional image". 

2-4- Turning back to the "d dimensional image" 

In order to be able to upper bound mfR e M N h(R, N) (tradeoff between bias and 
complexity) it is natural to introduce assumption about the "spatial" regularity 
(i.e regularity of the weights in the image) that can be handled by a RDPd. 
This is done by the following Definition and Assumption 

Definition 1. Let f : [0, l] d — > R be a piecewise constant function and B(f) 
be the set of points on which f is not continuous. Let N(f,r) be the minimal 
number of hypercubes from an RDP with lenght r that cover B{f). To each 
(3 > 0,M > we associate the set C'Md((3, M) of piecewise constant functions 
defined by 

{/ : [0, l] d -> M : / piecewise constant, WfW^ < M and N(f,r) < /3r~ (d_1) | . 

Assumption 6. ( d-dimensional regular image, d > 2). Let PLn be the reg- 
ular partition of [0, l] d into N identical hypercubes (i.e N pixels). For all k € 
{1, ...,K}, there exists (3 > 0, M < 00 and fk G CMd((3,M) (see definition 
{IP ) such that 

Tat, TTifc = /*(*<), (12) 
where ti is the center of the hypercube i of PLd ■ 

Remark 2. This assumption is an assumption on the topological structure of 
T/v • This structure is more complex when d is bigger. 

Proposition 1. With M.^ andpen^ as defined previously , and under Assump- 
tion® we have 

inf h(R,N)<cN( 1 *m 1/d 

R€M N ^ ' ' - \ N J 

for a positive constant c. 

The proof of this proposition can be founded in Donoho [7] or in the Annex 
of Kolaczyk et Al. [9]. 
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Corollary 1. Let d > 2 and suppose that for all i — 1, . . . , N , k = 0, 1, we 
have Pik = Pik- Under the Assumptions^ [7] an d if 9 is a classification rule 
constructed with the two step given in the preceding section ( defined by Equation 
0) ) with Px in the first step, given by Equation and (M. N,penN ) as defined 
in this section with Assumption^ fulfilled, then there exists a positive cq such 
that 

flog(N)\ 1/d 

where S is the excess risk of segmentation refined by Q). 

This corollary is a direct consequence of the preceding theorems. 

Remark 3. This result together with the one in the next Section may be seen 
as a complete convergence description of the algorithm that is used in Section 
[JJ Unfortunately it is not the case because the only application we have are not 
in the case where the number of possible class is two. 



3. Handling the learning step 

3.1. Dimension reduction in segmentation: a solution to step 1 in 
high dimension 

In this section, we investigate Step 1 when X — W under the following assump- 
tion 

Assumption 7. For k = 0, 1, i = 1, . . . , N , Phi is gaussian with mean fik and 
covariance C . For k = 0, 1, C~ jik has less than po + 1 non null components, 
where po is bounded with respect to p, n and N . The matrix C is diagonal. 

Note that, under this assumption, P^i does not depend on the position i. 

For k = 0, 1, we suppose that we have nk independent random variables Zkj 
{nk is a positive integer) drawn from distribution Pki- The set Z = {Zkj, k 
0. 1 . j — 1, . . . , nk} is the learning set. If A is a squared matrix, we will use the 
notation A~ for the associated generalised inverse. 

The algorithm for estimating Pk\ (k = 0, 1), i.e the learning step, is as follows. 

1. For i = 1, . . . ,p, compute a 2 [i] the unbiased empirical variance of (Zkj [i])k=o,i,j=i,..., 
and for k = 0, 1 compute p,k the empirical mean of {Zkj)j=i,...,n k - 

2. Compute / as 

t . u i le{1 ,..., ry ,a^ > ^m\ 
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3. The means /io and [i± are estimated by 



\ else A-". 1 . 



and the covariance C by the diagonal matrix C with diagonal elements 

CT-jJil if i G J . . 
else • = 1 '-.* 

Theorem 2. Let us take Mn ,periN as in subsection \2.4\ with Assumption 
fullfiled. Let us make Assumption^ and for i = 1, . . . , N k = 0, 1 compute Pki 
as a gaussian distribution with mean fik but assuming the covariance matrix C 
is known. Then under assumption QJ [^] and if g is a classification rule defined 
by Equation f3j) with Px in the first step, given by Equation J2J], we have: 



The proof of this Theorem is given in Subsection 16.41 The weakness of this 
theorem is that it require the knowledge of C . We did not succeed in giving 
a proof in more general case and we believe that further improvement of this 
result is beyond the scope of this paper. 



4. Application to hyperspectral image segmentation 

Before we give the details of our application to hyperspectral medical image seg- 
mentation we have to emphasis that the theoretical results we gave are designed 
for a two class segmentation (K = 2). However, in most application the number 
of possible classes is larger than two and the algorithms we gave can easily be 
extended to the case when K > 2. Indeed, the penalized maximum likelihood 
estimation of the weight (7Tj)j can be used when K > 2 and the dimension reduc- 
tion algorithm can be extended to a multiclass framework. This last extension 
can be done using a global measure of the contrast between groups. 

4-1- Application to medical hyperspectral segmentation 

Hyperspectral images of the brain from magnetic resonance imaging are high 
dimensional data. These images have only a few pixel (N = 256 pixels) giving 
the detail of a slice of the brain but on each pixel, we observe a high dimen- 
sional spectra with p — 1024, hence X = M 1024 . A given spectra is expected 
to give a complete information on the tissular characteristic at a given spatial 
position. These tissular characteristics can be classified into groups. In this med- 
ical problem, we have a learning set composed of 62 spectra from four different 
groups: 21 Glioblastomas of type A, 9 Glioblastomes of type B, 16 Meningiomcs, 
and 9 healthy tissues. We were given an hyperspectral image associated to a 
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Glioblastoma mixing type A and type B. This image (its spatial configuration) 
is simulated from spectra obtained in a real experimentation. The obtained seg- 
mentation and the true segmentation are given in Figure [2] Our conclusion is 
that the tumor is well localized but that the different types of Glioblastomas 
are not distinguished. 



nri»^L» n™J*dLww — r ^mJ^J^inrr imJ^J^inrT — H >|^~T uu^^uu- _^r<^|j^h_r u^H^L^r ^.n^^-. 



~J — — — <^*ilf^— — ^iL- 



Figure 1 . 10x10 square in the top left of the hyperspectral 16 X 16 image of a glioblastoma. 



Note that if the result are positive, this partly results from a pre-treatment 
of the data (ad-hoc re-phasing of the spectra) and from the fact that we did not 
include any metastases in the problem (metastases and glioblastomas are hard 
to distinguish). Studying automatic re-phasing will be the purpose of further 
research. 
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Figure 2. Obtained segmentation -on the left-, and segmentation that we should obtain - 
on the right-. The pixels that are colored in blue correspond to healthy tissues, the green is 
associated to type B Glioblastomas and the red is associated to type A glioblastomas. 



5. Conclusion 

We studied the problem of supervised segmentation. We gave theoretical results 
in a plugin perspective that allow to consider a wide range of model selection 
procedure. We showed that the procedure of Kolaczyk et Al. [5] (the mixlet 
procedure) can be applied consistently for segmentation of images with smooth 
boundaries. We gave a theoretical result that separate the segmentation error 
due to the learning step and the segmentation error due to the segmentation 
step (estimation of the weigh in the mixture model). We gave a reduction dimen- 
sion procedure for the learning step and gave associated theoretical results. The 
corresponding result gives the convergence rate of the whole segmentation pro- 
cedure (learning step plus segmentation step), this convergence rate is adapted 
to the case when the dimension p of the feature space of a pixel observation 
is much larger than the number n of elements in the learning set. Finally, we 
applied the whole methodology to medical image segmentation. 

6. Theoretical results 

6.1. A general result in segmentation with a plugin rule 

Any segmentation procedure g can be summarized by : X N — > [0;oo] N 
through g[i](X) = 1 K ^|,] (X) Vi € {1, ...,N}. We obtained Theorem [3] be- 
low which gives an upper bound on the excess risk S(g) under the following 
assumption on the error made while estimating R^: 
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Assumption 8. There exists Co,ci,C2 > 0, a sequence ^jy, with Ne~ cN ^ N 
0(iPn) for any d > 0, such that R^ : X — > [0,00]^, and — (i?7r[«])i=i,..., 
satisfy 

V<5 > P x (£{Rv,R^) >S) < c 2 e CaN ^ N ~^ s , (13) 



JV 



where is given by 

N 

Vx,y>0 £(x,y) =^fi 2 (x/y). (14) 

i=l 

and 

Vx > fi(:c) = ^— p (15) 
We also need the following additional assumption 
Assumption 9. There exists c > such that 

VtG{l,...,JV} Ep^i])] >c (16) 

Theorem 3. Le£ ws tofce : Af w -> [0, 00] and g the associated segmentation 
procedure with, for all i £ {1, ... , N} g[i] = II ^ [ji>i- Take R n and the associated 
optimal segmentation procedure g* as defined by Equation[]] Under Assumption 
QJ and if R-n satisfies assumption\3^ then 

S{g) < ipN 

where S is the excess risk defined by Equation^ 

This Theorem is proven in Annex. Assumption [3] is a weak assumption that 
will be satisfied if Poi and Pu are not as closed as desired (in i € {1, . . . , N}) 
for a well chosen distance. Notice that Assumptions [3] and |4] imply Assumption 

m 

The way to obtain inequality [13] in Assumption |8] will be the topic of Subsection 
16.21 but the Theory developed by Birge in [3] is our main reference and inspira- 
tion on the topic. 

To understand the interest of Theorem [3] one should notice that a simple 
analysis gives 

N 
i=l 

On the other hand, it is possible (using same argument that are used in the 
proof of Corollary [2]) to show that Assumption [8] implies 

E^iR^/R^] < il> N 

which gives 



S{g) < Vi> 



N • 
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However, Assumption [8] is weaker than 

and this shows that Theorem [3] is sharper than Equation 1171 
Proof. To simplify notation we will set 

N 

V nc R-Jin/x. no = 

N 



M = ^E ( i ^W)^' DC = J2u h (18) 

i=l i=l 

V/ C {1, . . . ,N} c(I) = i^E[n(^[i])], and M k = M1 DC , . 



The proof of the Theorem is decomposed into 3 steps 
Step 1: we claim that there exists 1 > c > such that 



P ( sup V Q,{R^[i))Ui < ck ) < e~ 2ck . (19) 

(where the supremum is taken over all subset of {0, ... , N} of size k). Indeed, 
we can notice that we have, for all I C {1, ... , N} of cardinal k 

P 1 5^n(Jk[i]) < c(J)fc/2] < P 1 5^n(i^[i]) -5^E[n(i^[i])] < -c(J)fc/2) 
Vie/ / Vie/ te/ / 



e CO* 

< e 2— 



where this last inequality results from the bounded difference inequality. Also, 
setting infj.m = k c(J) = 2c (c > from Assumption ??) gives Inequality 

Step 2: we claim that with cq, C\ > and ipN.M as in Assumption [5] we have 
P ( sup J" n(R n [i\)Ui > ck) < e c o^iv, n - C ic 2 /c ( 20 ) 
Cauchy Schwartz inequality gives, for all 7 c {1, ... , N} of cardinal k, 

N 

< kY,^ 2 (Rn[i\)Ui. 



iei / vie/ 



i=l 
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which implies: 

P ^sup 'Y^£l{R ir [i])U i > ckj < P x ^^0 2 (iMi])Z7j > c2fc ^j • 

Finally, Inequality [50] follows from Assumption [5] and the fact that, for any 
i e {1, . . . ,N}, Ui = 1 implies that n(i^[i]) < tt{Rn[i]/ R^i]). 

Step 3: Standard calculous (see for example [6] noticing that Q(ii 7r [i]) = 
— 1/2|) leads to 



E 



" N 

E 



l§(X)[i]7^ ~ lg*( X ){i]=£Y,\X 



= 2M. 



Also, we have 

E [M] < + y M 

C\C — ' 

<^±l N ^ Nn+ y sup Vo^w/iii]) 

Ar> fc >f0^jv V , N?i |i| -' t ie/ 

— — C1 c^ 

< ^±1n^ n „ + Y keCoN^n-h + ke -^ k 

C\C — 4 

N>k>S0±±Nib N n 

C\C Z 

where these last two inequalities follows from the results of step 1 and 2. Since 
Ne~ c N VN,n _ 0(4>N,n) for any d > this gives the desired result □ 



6. 2. A general oracle inequality for penalized maximum likelihood 
estimation in mixlet model 



In this Subsection, we give a general result on the estimation of the weigths. We 
first define the mean Hellinger distance between product measures. 

Definition 2. If P = HfLiPi and Q = ^iLiQi are two product distributions on 
X N , we will call mean Hellinger distance: Hm , the positive quantity defined by 

1 N 

H N (P,Q) 2 = -^h^P^Qi), (21) 
i=i 

where h 2 {Pi,Qi) — $ x (^fdPi — \/dQi) 2 is the squared Hellinger distance between 
P l and Qi. 
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Theorem 4. Suppose that Ai n satisfies the Assumption^ and that P x is given 
by Equation^ Then, under Assumption^ there exists d ', 



V<5>0 P[NH Z N {P X ,P X )>S 



<exp^c' inf {h{R,N)} + c"C Nn -6/ A 
[ ReM N 

where h is given by Equation \lU\ and C x . n by Eauation \ll\ 

Before we give the proof of this theorem, let us give some comments. The 
result of this theorem is an oracle inequality aiming to verify assumption ?? as it 
has been noticed in Subsection 12 .41 The function g(R, N) measures the tradeoff 
between bias and complexity for model R. The error term J2 i=1 x 2 (Pn, Pa) + 
X 2 (Pio,Pio) is related to step 1 but should also be connected to Remark ??. 

The Assumption [5] is necessary to obtain theoretical results. It is weaker than 
the Assumption 

sup 7frW<B (22) 

which is common in mixture model estimation (see for example the thesis of Li 
[12], or the work of Kolaczyk et Al. [9]). Note that the Assumption given by 
(|22"|) is not satisfied when the Pk are gaussian. Our Assumption [5] allows us to 
consider gaussian mixture. 

Kolaczyk et Al. |9] introduced the idea of mixture weight estimation by max- 
imum likelihood estimation. In the same paper, they give a theoretical result 
without using the mean Hellinger distance which weakened their result. In ad- 
dition, they consider only the case where Pik = Pik with d = 2 and use the 
assumption related to Equation 1221 From this point of view, our result (The- 
orem [5] together with Proposition [1]) is a significant improvement of the result 
obtained in [9J. Indeed, for alH = 1, . . . , N k = 0, 1, under assumption given by 
equation (|22[) . and assumption [5] with d = 2, there exists a positive constant Co 
such that 



E 



H N (Px,P) 



< c 



log N N 1/2 
N 



We did not succeed in using this last bound to obtain a result in the segmentation 
problem (such as Theorem [1} . This is the reason why we worked on obtaining 
stronger results such as Theorem @] and its consequences. 

The result may be difficult to apprehend in the preceding theorem, also we 
give the following simple corollary (it is a weaker result). 

Corollary 2. Let q > 1. Under the assumption of the preceding theorem, there 
exists a positive cq such that 



E[H%(P,P X )} < ( inf {h(R,N)} + C N ,„ 
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Proof of Theorem [4j 

Proof. The proof relies on the same principle that the one exposed by Birge in 
[3]- More precisely, the density P x is a penalized maximum likelihood estimator, 
but it is also a T-estimator. As a consequence, we have 



Px(NH 2 N (P x ,P x )>6) 
<Px{3QeM N :NH 2 N (Px,Q)>6 

Q(x) 



(23) 



and Vi? € Mn log 
< Y, Px(NH 2 N (P x ,Q)>5 

Q£M N 



R(X) 



> 4(pen N (Q) - pen N (R)) 



andyReM N log ^7^| > 4(pen N (Q) -pen N (R)) 
R{X) 

(from the sub-additivity of probability measures) . In addition, for all Q S M. n 
Markov inequality leads to 

Px ( NH 2 N (P Xl Q) > S andVi? e M N log ^§ > 4{pen N (Q) - pen N {R)) 



< 



>NH 2 N (P X ,Q)>5 



E 



( Q(x) \ 



R(X) 

1/4 



-penN (Q)+penjv (-R) 



(24) 



For all R, Q € Mn, by applying twice Cauchy-Schwartz inequality, we have: 



E 



[ Q(x) \ 

\R(X)J 



1/4 



< E 



Q(x) 

Px(X) 



1/2 



-,1/2 



Px(X) 
R+(X) 



1/4 



E 



( R + (X) \ 
V R(X) J 



1/4 



(this equation defines A, B and C) where 

R + = Id (iri(R)P i0 + (1 - in(R))P a ) . 
We first give an upper bound for A: 

This bound is easy to obtain by using the standard inequality 



Vi - 1, 



,7V E/ 



Q«(*[i]) \ 1/a 



< e" 



(25) 



Pxi(X\i]), 

Equations (|2"41 , (|2"5j) and (j2"3")l and Assumption [5] (Kraft inequality) then give 
Vi? € Mjv, P x (NH 2 N (P Xl Px) >5)< e ^»(-R)+l°6(-B(«)c(R)) e -|. 
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We now only need to show that 

log (£(#)) < dh(R,N) 
(h(R,N) is given by Equation (ITU1) ) and 

log(C(R))<c"C N>n 

with d > and c" > 0. 



(26) 



(27) 



Let us begin with Equation (|26|) . Easy calculous (using in particular the 
concavity of x — » a; 1 / 2 lead to 



1 W / 

log (P(P)) < - £ log 1 + (tt^Px) - 7r,(P))E 



PiO ~ Pil 



(X\i)) 



(28) 



On the other hand, Assumption |4] easily gives 



E 



Pjo — Pa 
Ri 



< 



(for a positive constant c) which gives, using Equation (|2"8")l and the inequality 
log(x + 1) < x \lx > -1, Equation pjj) . 

Lemma 1. Le£ P and Q 6e iwo equivalent probability measures. We then have 



sup Eq 

ase[0,l] 



aepa] EQ \_xQ + {1- x)P 

Q 



< l 



<max(l, X 2 (Q,P)) + 1 



xQ + (1 - x)P 

Let now P and Q be two other equivalent measures we then have 
' xQ + (1 - x)P 



sup IsLq 
xe[o,i] 



xQ+(l -x)P 



<max X (<3,<2) + LEq 



p-p 






*■) 


p 





The proof of this Lemma is only simple variational analysis (all the functions 
of x that appear have maximum on or 1), and the use of the identity 



P 



x 2 (P,Q) + i. 



We now show Equation (|27|) . We have 
4 



1 - / 
log(C(P))<-^log E 



^(P)Po + (l-^(P))Pi 

7T l (P)P + (l-^(P))Pl 
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and this, together with the third equation of the preceding Lemma, gives: 
1 N 

log (C(i?)) < -^log(( 7 r i (Px)Ao + (l-7r i (Px))Ai) + l), 

8=1 

with, for all i = 1 . . . , N, 

D i0 = max ^x 2 (Pi , P i0 ),E Pza 

and 



Ai =max x 2 (^i,Ai),E Pil 



Pil — Pil 

Pa 

PiO — PiO 



PiO 



Using the inequality max(a, b) < a + b for all a, b g R gives the desired restult 
(i.e Equation P?])). □ 

6.5. Proof o/ Theorem^ 

The aim of this proof is to use Theorem 01 First, one should notice that As- 
sumptions [3] and H] imply assumption [5J Let us define — (Rn[i\)i=i,....N with 

TTi dP i0 

Then use the following lemma gives a kind of triangular inequality, it results 
from simple analysis. 

Lemma 2. There exits c > such that Vx, y > 

Q(xy) <c(Q(x)+Q.(y)) . 
Using the Lemma with x = R n /R n and y = R^/R^ gives 

n 2 (ik [<]/£*[*]) < c fn 2 ^]/^]) + o 2 (p w [i]/p T [i]) 



Also, because P(X + Z > 5) < P(X >5/2) + P(X > 8/2) for any real valued 
random variable A, Y, we have 



VS > P x [£{R^,R*) >5)<P X (£(R„,Rk) > S/2)+P x J2») > <J/2 

(29) 
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We first bound the first term of the right hand side of Inequality [29] Using 
Assumption 0] gives 

tfiR^/R^]) < d\n(P x )\i] -m\ 2 

(for a positive constant c using Assumption [4]) 

< / h 2 (P i0l Pji) 
~ ° \Pio-Pa\t 

from LeCam Inequality) 

< c"/i 2 (P i0 ,Pa) 
from Assumption [3]) . 

Also, 

N 

^0 2 (iMi]/iUi]) < c"NH N (P x ,P x ) (30) 

i=l 

and we can use Theorem 0] to conclude that 

P x (£(R w ,R w )>8y (31) 

<expjc' inf {h(R 7 N)} + c" C N „ - 5/4 1 , 
l_ ReM N J 

where /i is given by Equation 1101 

Let us now bound the second term of the right hand side of Inequality 1291 
To simplify notations, we set 



N 

e = 

i=l 



The finite difference inequality implies that 

V< > P n 2 (Rir[i]/R\t[i]) > e + tj < 
Also, taking t — {8/2 — e) , ( {x) + stands for the positive part of x) gives 

/ N \ 



\/s>2e p[J2 ^ 2 {RM/RM) >S/2\< 



2(5/2- 

e n 



\i=l / 

Finally, because e < iV0jv,M> there exists co,ci,C2 > such that 
VS > P x (£{Rk,R-k) >5)< c 2 e CoN ^-^ s , 

This inequality and inequality 1311 imply that Assumption [5] from Theorem [3] 
is fulfilled and ends the proof. 
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6-4- Proof of Theorem[ 



First notice that the subscript i in Theorem [2] does not play any role, we will 
chose i = 1 and omit the corresponding subscript in the rest of the proof (in 
particular g*(x) will stand for g*(x)[l]). Because S(g) is upper bounded by 1, 
we only need to upper bound separately the 3 terms that substantially appear 
in E[win(C N , n /N, 1)]: 



Ei =E 













E 


min ^1, E 




I 


, E 2 =E 




. \dPj_ 







iin(l,x 2 (Pi,Pi)) 



Ba =E 



min 1, Ep 



'Pl-Pl 


)] 


Pi 





where the last expectations in E\ Ei and E% are with respect to the learning 
set. 

Upper bound for E3 Simple calculation gives 



Pi -Pi 
Pi 



Ef 



1 



(32) 



where 



C(x) 



1 /„_- f , Ai +Mi 



Setting £ = C _1 / 2 (X - Mo ) in Equation [21 gh 



E, 



E 



where £ is a gaussian random variable with mean zero and covariance I p and 
C{x) — L{C x l 2 x + /l«o). We now use corollary 1.7.9 in 4 which gives 



E 



3 £(£)-E[£(£)] 



<E[< 



and implies 



min(Ep 



- 1 



1) < \\c- 1/2 (m AOfe + ( cr-Oix - AO,mo - 



< ||C 1/2 (mi - Ai)IIep + (C (Mi - Ai),Mo - Mi), 
Because we assumed that C is known, we have 

C- 1 / 2 ( Ml -/i 1 )=C- 1 / 2 /xi-^(0 



where S H ■ K p -> M p is the hard threshold operator with threshold y2^2SM 

and £ is a gaussian R p random vector with mean C -1 / 2 /ii and variance —I p . 
Also, from Donoho and Johnstone [8], we can show that 



E 



min(Ep 



< 



log(p) 
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Upper bound for E\ The upper bound for E\ easily follows from the fact that 



Q 2 (dP 1 /dP 1 ) 



<E 



log 2 (dP 1 /dP 1 



<E [C 2 (X)\ 

Upper bound for E2 follows directly from the upper bound for E3 since 



X 2 (P 1 ,P 1 )=E Pl 



1. 



This ends the proof. 



6.5. Proof of corollary [2| 

Proof. We only have to use Proposition 3 of Birge [3] : 
Lemma 3. Let Y be a positive random variable with 

P(Y > y) < ae~ y for y > y and a > 0. 

Then, for all q > 1, 

W q ]<y q {i + aUy)), 

where £ q is a function defined on M + decreasing and such that 

Vx > cq, Cq(x) = ^e _I , where c— 1/2 if q < 2ire and 0.612 otherwise . 

We applied the preceding Theorem to check the hypothesis of the Lemma 
with 

y 2 = 2(c' inf {g(R,N)} + c"C N , n 
a = e3 , Y 2 — NH 2 ^(Px,P)- As a consequence, when y > (cq) 2 we have 

a( q (y) < \e-y' 2 , 

which leads to the desired result (for N large enough, and because < 2, for 
all N by changing the constant). □ 
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